Over/Under Betting Strategy
logo

CHALLENGE PRESENTATION ——-

The main goal is to formulate an over/under betting strategy based on the football_example csv file.

METHODOLOGY ——-

1. Identify the type of problem : supervised learning or unsupervised learning ? regression or classification ? Define a error metric to compare models.
The challenge is a regression problem and the Mean Absolute Error (MAE) will be the metric to evaluate the performance.

2. Data understanding - missing values, variable types (One hot encoding : Many models require all variables to be numeric) - Handle missing values if necessary (delete rows or mean/median imputations).

3. Descriptives statistics - response variable distribution. Main statistics (min, max, mean, quantiles, standard deviation, mode (for categorical variables)) for each feature.

4. Correlation analysis between response variable and other numeric variables (or categorical variables after one hot encoding)

5. Data Splitting : training set (used to train algorithms and tune hyper-parameters) and testing set (used to estimate its prediction error (generalization error)). Data from test set won’t be used during model training.

6. Cross validation on the training set in order to define the best strategy.
Comparison of different algorithms (OLS, Ridge, Lasso, … Random Forest, XGBoost) by tuning the hyper-parameters in the cross validation.

7. Identify the best strategy by analyzing the cross validation errors for each model.

8. Retrain the best model (i.e model with best hyper-parameters from cross validation) on the full training set & assess performance on the testing set.
In the following code, we’ll tune some hyper-parameters in the cross validation but without seeking for the optimal combination (grid search).

9. Formulate a over/under betting strategy based on Quantile Prediction with Random Forest. Check the performance and the robustness of the strategy


LOAD PACKAGES

# Disable scientific notation 
options(scipen=999)

# Load useful packages ----

# Data manipulation
library(tidyverse)
library(data.table)
library(broom)

# String manipulation
library(stringr)

# Resample data 
library(rsample)
library(recipes)

# Correlation Analysis 
library(correlationfunnel)

# Machine learning
library(caret)
library(mlr)
library(glmnet)
library(gbm)
library(xgboost)
library(e1071)
library(neuralnet)
library(kernlab)
library(rpart)
library(ranger)
library(randomForest)
library(splines)

# Graphics 
library(ggplot2)
library(plotly)

# Data Cleaning
library(janitor)

# EDA 
library(skimr)

# Apply mapping functions 
library(purrr)
library(furrr) # in parallel 

# Rmarkdown 
library(epuRate)
library(rmarkdown)
library(knitr)
library(formattable)
library(kableExtra)

# TIming R scripts
library(tictoc)

EXPLORATION DATA ANALYSIS

# # Exploring data 
# summarize_df<- mlr::summarizeColumns(df)

# there are some algorithms which don’t require you to impute missing values
# You can simply supply them missing data. They take care of missing values on their own
# Let’s see which algorithms are they
# listLearners("regr", check.packages = TRUE, properties = "missings")[c("class","package")]

# To get the list of parameters for any algorithm
# getParamSet("regr.cvglmnet")

# Import data from Github 
link_github <- "https://raw.githubusercontent.com/benj7/betting_strategy/master/football_example.csv"
df <- fread(link_github)
# 38,810 rows and 22 variables 

# initial data backup
df_backup <- df

Descriptive Statistics

# Descriptive stats 
skim(df)
Data summary
Name df
Number of rows 38810
Number of columns 22
_______________________
Column type frequency:
character 7
numeric 15
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
market_id 0 1 11 11 0 1719 0
match_start_time_gmt 0 1 19 19 0 1285 0
league 0 1 13 14 0 2 0
home 0 1 5 24 0 47 0
away 0 1 5 24 0 47 0
bet_type 0 1 5 5 0 2 0
selection 0 1 4 4 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
league_id 0 1 34.56 0.50 34.00 34.00 35.00 35.00 35.00 ▆▁▁▁▇
elo_home_ft 0 1 1195.63 136.51 892.67 1105.89 1174.10 1256.89 1720.75 ▂▇▃▁▁
elo_home_fh 0 1 1106.59 88.32 874.44 1053.60 1095.65 1144.09 1460.90 ▁▇▅▁▁
elo_away_ft 0 1 1114.49 136.44 839.25 1016.95 1105.15 1181.36 1623.97 ▃▇▃▁▁
elo_away_fh 0 1 1061.47 86.53 851.85 1001.90 1052.93 1113.96 1397.26 ▂▇▅▂▁
home_score 0 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ▁▁▇▁▁
away_score 0 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ▁▁▇▁▁
observations 0 1 1480.61 896.78 467.00 467.00 2274.00 2274.00 2274.00 ▆▁▁▁▇
minute 0 1 15.00 0.00 15.00 15.00 15.00 15.00 15.00 ▁▁▇▁▁
final_home_score 0 1 1.19 1.20 0.00 0.00 1.00 2.00 9.00 ▇▃▁▁▁
final_away_score 0 1 0.90 1.03 0.00 0.00 1.00 1.00 6.00 ▇▂▁▁▁
handicap 0 1 2.06 1.05 0.50 1.25 2.00 2.75 7.00 ▇▆▂▁▁
league_prob_at_least 0 1 0.44 0.22 0.00 0.28 0.40 0.57 0.90 ▅▂▇▇▂
odds 0 1 1.24 1.02 0.06 0.53 0.95 1.60 7.38 ▇▂▁▁▁
profit 0 1 -0.04 0.99 -1.00 -1.00 0.00 0.58 7.07 ▇▂▁▁▁



It’s very interesting to notice that the average profit is negative: -0.04. If we decide to place all bets, the overall profit is negative, equals to -1721.

Variables “home_score” and “away_score” which are the score at the 15th minute for home and away teams are always equal to 0. This variable is not reliable enough to be used for modelling.

Response Variable Distribution

plot_ly(data = df, 
        x = ~profit,
        type = "histogram",
        histnorm = "probability") %>%  
  layout(title = "Frequency distribution of Profit",
         xaxis = list(title = "Profit",
                      zeroline = FALSE),
         yaxis = list(title = "Frequency",
                      zeroline = FALSE))



Around 54% of the bets have a profit negative or equals to 0.
The response variable is skewed but we can’t consider log transformation since the target can be negative

CORRELATION ANALYSIS


Correlation Analysis on data to identify key features that relate to the target (profit)

# Feature Engineering ----

df <- df %>% mutate(
  
  # Rename modalities of the variable "selection"
  selection = case_when(selection == "home" ~ "over",
                        selection == "away" ~ "under"),
  
  # Selection x Handicap 
  selection_handicap = str_c(selection,
                             handicap,
                             sep = "_"),
  
  # Compute difference between elo_home and elo_away
  elo_diff_ft = elo_home_ft - elo_away_ft,
  
  elo_diff_fh = elo_home_fh - elo_away_fh,
  
  # Compute difference between league_prod_at_least and implied probability from odds 
  diff_probs = league_prob_at_least - 1/(odds+1),
  
  # Total number of goals
  nb_goals = final_home_score + final_away_score,
  
  # Is profit positive? 1: Yes - 0:No
  positive_profit = if_else(profit > 0, 1, 0),
  
  # Ratio between profit and odds 
  profit_odd_ratio = profit/odds)

df <- as.data.table(df)

df <- df[profit < 0, profit_odd_ratio := profit]

# Round numeric variables to 3 decimal
col_num <- which(sapply(df,class) == "numeric")

sdcols <- names(df)[col_num]
  
df_num <- df[, lapply(.SD, function(x) round(x,3)), .SDcols = sdcols]

df_no_num <- df[, .SD, .SDcols = setdiff(names(df), sdcols)]

df <- cbind(df_no_num, df_num)

# Correlation Analysis - Identify key features that relate to the target ("profit")

# Transform the data into a binary format 
df_binarized <- df %>%
  # remove useless variables with no information or variables that we don't know before betting 
  select(-c(market_id, 
            match_start_time_gmt, 
            league_id, 
            home, 
            away,
            home_score,
            away_score, 
            observations,
            minute,
            final_home_score,
            final_away_score,
            minute,
            nb_goals,
            selection_handicap, # interesting but makes the chart unreadable 
            profit,
            profit_odd_ratio)) %>%
  binarize(n_bins = 5 , thresh_infreq = 0.01, name_infreq = "OTHER", one_hot = TRUE)

# Perform correlation analysis 
df_binarized_corrl <- df_binarized %>%
  correlate(positive_profit__1)

# Visualize the feature-target relationships
df_binarized_corrl %>%
  plot_correlation_funnel()



Low odds are logically more correlated with a positive profit. According to this analysis, the “diff_probs” feature seems to be correlated with the profit - that’s interesting for modelling purposes!

Handicap between 1-1.75 and Under bets are slightly more positively correlated with positive profit

It makes more sense to analyze the variable “selection x handicap” but that unfortunately makes the chart unreadable

# Remove useless variables with no information or variables that we don't know before betting 
df <- as.data.table(df)

features_to_keep <- setdiff(names(df),c("market_id",
                                        "match_start_time_gmt",
                                        "league_id", 
                                        "home", 
                                        "away",
                                        "home_score",
                                        "away_score", 
                                        "observations",
                                        "minute",
                                        "final_home_score",
                                        "final_away_score",
                                        "minute",
                                        "nb_goals",
                                        "selection_handicap",
                                        "positive_profit",
                                        "profit_odd_ratio"))

df_model <- df[, .SD, .SDcols = features_to_keep]

# one-hot encode our categorical variables
one_hot <- dummyVars(~ ., df_model, fullRank = FALSE)
df_model_hot <- predict(one_hot, df_model) %>% as.data.frame()

MODELING

Data Splitting

# Response variable renaming  
df_model_hot <- df_model_hot %>%
  rename(Y = profit)

# Separate into random training and test data sets ----

# set.seed() for reproducibility 
set.seed(42) 

# We keep 80% of the data for model training 
split_strat <- initial_split(df_model_hot, prop = 0.8)
df_train  <- training(split_strat)
df_test   <- testing(split_strat)



We implement a 5-fold cross validation manually (for loop) on the training set. This gives us a lot of flexibility and provides a good understanding of what goes on behind the scenes.

For large-scale machine learning projects, I use H2O which automates the cross validation and hyperparameter tuning process. It’s much faster and more scalable since H2O is written in Java.

5-fold Cross Validation

# Training set 
X <- as.data.table(df_train %>% select(-Y))
Y <- df_train$Y

XY <- cbind(Y,X)
names(XY)[1] <- "Y"

# Testing set 
X_test <- as.data.table(df_test %>% select(-Y))
Y_test <- df_test$Y

XY_test <- cbind(Y_test, X_test)
names(XY_test)[1] <- "Y"

# Cross validation 
# Higher CV metrics determine that our model does not suffer from high variance and generalizes well on unseen data

# Number of folds
N <- 5

# Folds building
set.seed(42)
Folds <- sample(1:dim(XY)[1] %% N + 1)

Output <- list()
Time <- list()

i <- 1

# for loop to perform cross validation 
for(i in 1:N){
  Predictions <- list()
  
  # Estimation samples
  XE <- X[Folds != i]
  YE <- Y[Folds != i]
  # Data needs to be in a matrix format for some models 
  XME <- as.matrix(XE)
  YME <- as.matrix(YE)
  XYE <- XY[Folds != i]
  
  # Prediction samples 
  XP <- X[Folds == i]
  YP <- Y[Folds == i]
  # Data needs to be in a matrix format for some models 
  XMP <- as.matrix(XP)
  YMP <- as.matrix(YP)
  XYP <- XY[Folds == i]
  
  print(stringr::str_c("Fold number ", i))
  print("**************************************************")
  
  # Add the Y variable to the prediction dataset (for comparison purposes)
  Predictions[["Y"]] <- YP
  Predictions[["Y"]] <- as.numeric(Predictions[["Y"]])
  
  #********************* ORDINARY LEAST SQUARES **********************#
  
  print(Nom <- "OLS (LM)")
  Start <- proc.time()
  Model <- stats::lm(formula = Y ~ ., data = XYE)
  Response <- predict(Model, newdata = XP, type = "response")
  Predictions[[Nom]] <- as.numeric(Response)
  Time[[Nom]] <- ifelse(is.null(Time[[Nom]]), 0, Time[[Nom]]) + (proc.time() - Start)[3]
  
  #********************* REGULARIZED REGRESSIONS *********************#
  
  print(Nom <- "OLS (GLMNET)")
  Start <- proc.time()
  Model <- glmnet::glmnet(x = XME, y = YME, lambda = 0, alpha = 0, family = "gaussian")
  Response <- predict(Model, newx = XMP, type = "response")
  Predictions[[Nom]] <- as.numeric(Response)
  Time[[Nom]] <- ifelse(is.null(Time[[Nom]]), 0, Time[[Nom]]) + (proc.time() - Start)[3]
  
  print(Nom <- "Ridge (GLMNET)")
  Start <- proc.time()
  Model <- glmnet::cv.glmnet(x = XME, y = YME, alpha = 0, family = "gaussian")
  Response <- predict(Model, s = "lambda.min", newx = XMP, type = "response")
  Predictions[[Nom]] <- as.numeric(Response)
  Time[[Nom]] <- ifelse(is.null(Time[[Nom]]), 0, Time[[Nom]]) + (proc.time() - Start)[3]
  
  # Regularized Regression to reduce the model variance 
  # alpha = 1 for lasso (push some coefficients to 0 for some variables)
  # alpha = 0 for ridge (no coefficients at 0)
  
  # λ is a tuning parameter that helps to control our model from over-fitting to the training data
  # However, to identify the optimal λ value we need to perform cross-validation (CV)
  # cv.glmnet provides a built-in option to perform k-fold CV, and by default, performs 10-fold CV
  
  # the 10-fold CV mean squared error (MSE) across the λ values
  
  # number of the top = number of variables 
  # The first vertical dashed lines represent the λ value with the minimum MSE 
  # and the second one the largest λ value within one standard error of the minimum MSE
  
  # min(Model$cvm)       # minimum MSE
  # Model$lambda.min     # lambda for this min MSE
  
  # Model$cvm[Model$lambda == Model$lambda.1se]  # 1 st.error of min MSE
  
  #  Model$lambda.1se  # lambda for this MSE
  
  # Influential variables 
  
  # coef(Model, s = "lambda.1se") %>%
  #   tidy() %>%
  #   filter(row != "(Intercept)") %>%
  #   ggplot(aes(value, reorder(row, value), color = value > 0)) +
  #   geom_point(show.legend = FALSE) +
  #   ggtitle("Influential variables") +
  #   xlab("Coefficient") +
  #   ylab(NULL)
  
  print(Nom <- "Lasso (GLMNET)")
  Start <- proc.time()
  Model <- glmnet::cv.glmnet(x = XME, y = YME, alpha = 1, family = "gaussian")
  Response <- predict(Model, s = "lambda.1se", newx = XMP, type = "response")
  Predictions[[Nom]] <- as.numeric(Response)
  Time[[Nom]] <- ifelse(is.null(Time[[Nom]]), 0, Time[[Nom]]) + (proc.time() - Start)[3]
  
  # coef(Model, s=Model$lambda.1se)
  
  Alphas <- c(0.25, 0.5, 0.75)
  
  for(Alpha in Alphas){
    print(Nom <- stringr::str_c("Elastic net (GLMNET|alpha=", Alpha, ")"))
    Start <- proc.time()
    Model <- glmnet::cv.glmnet(x = XME, y = YME, alpha = Alpha, family = "gaussian")
    Response <- predict(Model, s = "lambda.min", newx = XMP, type = "response")
    Predictions[[Nom]] <- as.numeric(Response)
    Time[[Nom]] <- ifelse(is.null(Time[[Nom]]), 0, Time[[Nom]]) + (proc.time() - Start)[3]
  }
  
  #********************* TREE *******************************#
  
  print(Nom <- "Tree (RPART)")
  Start <- proc.time()
  Model <- rpart::rpart(formula = Y ~ ., data = XYE)
  Response <- predict(Model, newdata = XP, type = "vector")
  Predictions[[Nom]] <- as.numeric(Response)
  Time[[Nom]] <- ifelse(is.null(Time[[Nom]]), 0, Time[[Nom]]) + (proc.time() - Start)[3]
  
  # rpart.plot(Model)
  
  #********************* RANDOM FORESTS *********************#
  # tune the number of tree hyperparameter in the cross validation 
  Numbers_Of_Trees <- c( 100, 250, 500)
  
  # make ranger compatible names
  names(XYE) <- make.names(names(XYE), allow = FALSE)
  names(XP) <- make.names(names(XP), allow = FALSE)
  
  for (ntree in Numbers_Of_Trees) {
    
    print(Nom <- stringr::str_c("Random forest (RANDOMFOREST|ntree=", ntree, ")"))
    Start <- proc.time()
    features <- setdiff(names(XYE), "Y")
    Model <- ranger(formula    = Y ~ .,
                    data       = XYE,
                    num.trees  = ntree, # number of trees 
                    mtry       = floor(length(features) / 3), # the number of variables 
                    # to randomly sample as candidates at each split
                    # usal value for regression = p/3 with p the total number of variables  
                    respect.unordered.factors = 'order',
                    verbose    = FALSE,
                    seed       = 42)
    Response <- predict(Model, data = XP, type = "response")
    Response <- Response$predictions
    Predictions[[Nom]] <- as.numeric(Response)
    Time[[Nom]] <- ifelse(is.null(Time[[Nom]]), 0, Time[[Nom]]) + (proc.time() - Start)[3]
  }
  
  #********************* BOOSTING *************************#
  
  # Xgboost - apprx 10x faster than gbm
  # For boosting tree models, the most important parameter is the number of trees to build - nrounds parameter
  # XGBoost only works with matrices that contain all numeric variables
  
  # tune the number of tree hyperparameter in the cross validation 
  Numbers_Of_Trees <-  c(100, 250, 500)
  
  for (nrounds in Numbers_Of_Trees){
    print(Nom <- stringr::str_c("XGBoost|n.trees=", nrounds))
    Model <- xgboost(data = XME, 
                     label = YME,
                     nrounds = nrounds,
                     eta = 0.1, # learning rate 0.3 by default
                     max_depth = 4, #  Maximum depth of a tree. 6 by default
                     objective = "reg:linear",
                     print_every = 1000) # hide intermediate results 
    Response <- predict(Model , XMP, missing = NA)
    Predictions[[Nom]] <- as.numeric(Response)
    Time[[Nom]] <- ifelse(is.null(Time[[Nom]]), 0, Time[[Nom]]) + (proc.time() - Start)[3]
  }
  
  # #********************* SVM *****************************#
  # NOT RUN 
  # tune Kernel and Cost hyper-parameters in the cross validation 
  # Kernels <- c("linear", "polynomial", "radial")
  # Costs <- c(0.1, 1, 2)
  # 
  # for(Kernel in Kernels)
  # {
  #   for(Cost in Costs)
  #   {
  #     print(Nom <- stringr::str_c("SVM (e1071|kernel=", Kernel, "|cost=", Cost, ")"))
  #     Start <- proc.time()
  #     Model <- e1071::svm(Y ~ ., data = XYE, kernel = Kernel, cost = Cost)
  #     Response <- predict(Model, newdata = XP, type = "response")
  #     Predictions[[Nom]] <- as.numeric(Response)
  #     Time[[Nom]] <- ifelse(is.null(Time[[Nom]]), 0, Time[[Nom]]) + (proc.time() - Start)[3]
  #   }
  # }
  
  Output[[1 + length(Output)]] <- as.data.table(do.call(cbind, Predictions))
  
  print("**************************************************")
}
## [1] "Fold number 1"
## [1] "**************************************************"
## [1] "OLS (LM)"
## [1] "OLS (GLMNET)"
## [1] "Ridge (GLMNET)"
## [1] "Lasso (GLMNET)"
## [1] "Elastic net (GLMNET|alpha=0.25)"
## [1] "Elastic net (GLMNET|alpha=0.5)"
## [1] "Elastic net (GLMNET|alpha=0.75)"
## [1] "Tree (RPART)"
## [1] "Random forest (RANDOMFOREST|ntree=100)"
## [1] "Random forest (RANDOMFOREST|ntree=250)"
## [1] "Random forest (RANDOMFOREST|ntree=500)"
## [1] "XGBoost|n.trees=100"
## [1]  train-rmse:1.103780 
## [100]    train-rmse:0.940591 
## [1] "XGBoost|n.trees=250"
## [1]  train-rmse:1.103780 
## [250]    train-rmse:0.882290 
## [1] "XGBoost|n.trees=500"
## [1]  train-rmse:1.103780 
## [500]    train-rmse:0.822100 
## [1] "**************************************************"
## [1] "Fold number 2"
## [1] "**************************************************"
## [1] "OLS (LM)"
## [1] "OLS (GLMNET)"
## [1] "Ridge (GLMNET)"
## [1] "Lasso (GLMNET)"
## [1] "Elastic net (GLMNET|alpha=0.25)"
## [1] "Elastic net (GLMNET|alpha=0.5)"
## [1] "Elastic net (GLMNET|alpha=0.75)"
## [1] "Tree (RPART)"
## [1] "Random forest (RANDOMFOREST|ntree=100)"
## [1] "Random forest (RANDOMFOREST|ntree=250)"
## [1] "Random forest (RANDOMFOREST|ntree=500)"
## [1] "XGBoost|n.trees=100"
## [1]  train-rmse:1.105421 
## [100]    train-rmse:0.935720 
## [1] "XGBoost|n.trees=250"
## [1]  train-rmse:1.105421 
## [250]    train-rmse:0.890443 
## [1] "XGBoost|n.trees=500"
## [1]  train-rmse:1.105421 
## [500]    train-rmse:0.823558 
## [1] "**************************************************"
## [1] "Fold number 3"
## [1] "**************************************************"
## [1] "OLS (LM)"
## [1] "OLS (GLMNET)"
## [1] "Ridge (GLMNET)"
## [1] "Lasso (GLMNET)"
## [1] "Elastic net (GLMNET|alpha=0.25)"
## [1] "Elastic net (GLMNET|alpha=0.5)"
## [1] "Elastic net (GLMNET|alpha=0.75)"
## [1] "Tree (RPART)"
## [1] "Random forest (RANDOMFOREST|ntree=100)"
## [1] "Random forest (RANDOMFOREST|ntree=250)"
## [1] "Random forest (RANDOMFOREST|ntree=500)"
## [1] "XGBoost|n.trees=100"
## [1]  train-rmse:1.105124 
## [100]    train-rmse:0.938242 
## [1] "XGBoost|n.trees=250"
## [1]  train-rmse:1.105124 
## [250]    train-rmse:0.883623 
## [1] "XGBoost|n.trees=500"
## [1]  train-rmse:1.105124 
## [500]    train-rmse:0.815723 
## [1] "**************************************************"
## [1] "Fold number 4"
## [1] "**************************************************"
## [1] "OLS (LM)"
## [1] "OLS (GLMNET)"
## [1] "Ridge (GLMNET)"
## [1] "Lasso (GLMNET)"
## [1] "Elastic net (GLMNET|alpha=0.25)"
## [1] "Elastic net (GLMNET|alpha=0.5)"
## [1] "Elastic net (GLMNET|alpha=0.75)"
## [1] "Tree (RPART)"
## [1] "Random forest (RANDOMFOREST|ntree=100)"
## [1] "Random forest (RANDOMFOREST|ntree=250)"
## [1] "Random forest (RANDOMFOREST|ntree=500)"
## [1] "XGBoost|n.trees=100"
## [1]  train-rmse:1.102575 
## [100]    train-rmse:0.935799 
## [1] "XGBoost|n.trees=250"
## [1]  train-rmse:1.102575 
## [250]    train-rmse:0.881310 
## [1] "XGBoost|n.trees=500"
## [1]  train-rmse:1.102575 
## [500]    train-rmse:0.815089 
## [1] "**************************************************"
## [1] "Fold number 5"
## [1] "**************************************************"
## [1] "OLS (LM)"
## [1] "OLS (GLMNET)"
## [1] "Ridge (GLMNET)"
## [1] "Lasso (GLMNET)"
## [1] "Elastic net (GLMNET|alpha=0.25)"
## [1] "Elastic net (GLMNET|alpha=0.5)"
## [1] "Elastic net (GLMNET|alpha=0.75)"
## [1] "Tree (RPART)"
## [1] "Random forest (RANDOMFOREST|ntree=100)"
## [1] "Random forest (RANDOMFOREST|ntree=250)"
## [1] "Random forest (RANDOMFOREST|ntree=500)"
## [1] "XGBoost|n.trees=100"
## [1]  train-rmse:1.103063 
## [100]    train-rmse:0.932974 
## [1] "XGBoost|n.trees=250"
## [1]  train-rmse:1.103063 
## [250]    train-rmse:0.885102 
## [1] "XGBoost|n.trees=500"
## [1]  train-rmse:1.103063 
## [500]    train-rmse:0.819950 
## [1] "**************************************************"

Cross Validation Errors

# Stack all the outputs tables up 
Output <- data.table::rbindlist(l = Output, use.names = TRUE, fill = TRUE)

# Measure execution time 
Time <- data.table(Model = names(Time), Time = unlist(Time))

# Function to define the error metric : Mean Absolute Error 
loss_function <- function(X,Y){
  mean(abs(X-Y))
}

# Assess cross validation errors for each model 
Errors <- apply(X = Output, MARGIN = 2, FUN = loss_function, Y = Output$Y)
Errors <- data.table(Model = names(Errors), error_rate = Errors)
Errors <- merge(x = Errors, y = Time, by = "Model", all.x = TRUE, all.y = FALSE)
Errors <- Errors[order(error_rate)]
Errors



After analysis of the cross validation errors, the best model is Random Forest with 500 trees.

Let’s retrain on the full training set and predict on the testing set.

PREDICT ON TEST SET

Retrain on the full training set

# Training set 
X <- as.data.table(df_train %>% select(-Y))
Y <- df_train$Y

XY <- cbind(Y,X)
names(XY)[1] <- "Y"

# Testing set 
X_test <- as.data.table(df_test %>% select(-Y))
Y_test <- df_test$Y

XY_test <- cbind(Y_test, X_test)
names(XY_test)[1] <- "Y"

# make ranger compatible names
names(XY) <- make.names(names(XY), allow = FALSE)
names(X_test) <- make.names(names(X_test), allow = FALSE)

features <- setdiff(names(XY), "Y")

# Best Model: Random Forest 500 trees according to the the cross validation errors 

Model_final<- ranger(formula    = Y ~ .,
                     data       = XY,
                     num.trees  = 500, 
                     mtry       = floor(length(features) / 3), 
                     respect.unordered.factors = 'order',
                     # To plot variable imortance 
                     importance = "impurity",
                     # Quantile regressions forest to assess prediction intervals
                     quantreg   = TRUE,
                     verbose    = FALSE,
                     seed       = 42)

Performance on the test set

# Predict on the test set
predictions_testRF <- predict(Model_final,
                              data = X_test, 
                              type = "response")

predictions_test <- data.frame(cbind(Y_test, predictions_testRF$predictions))

# Performance on the test set
loss_function(X = predictions_testRF$predictions, Y = Y_test)
## [1] 0.5556321



The Mean Absolute Error equals to 0.56 on the test set.

Let’s determine the main discriminant variables by plotting the variable importance.

Variable Importance

#  Plot Top important variables 

var_imp_ranger<-as.data.frame(Model_final$variable.importance)
variables <-(as.vector((row.names(var_imp_ranger))))
var_imp_ranger <- cbind(variables,var_imp_ranger)
var_imp_ranger <- var_imp_ranger %>% 
  arrange(desc(`Model_final$variable.importance`))


p <- ggplot(var_imp_ranger,
       aes(x = reorder(variables,`Model_final$variable.importance`) ,
           y = `Model_final$variable.importance`, 
           fill = `Model_final$variable.importance`))+ 
  geom_bar(stat="identity",
           position="dodge",
           aes(text = `Model_final$variable.importance`))+
  coord_flip()+
  ylab("Importance")+
  xlab("")+
  ggtitle("Variable Importance - Random Forest 500 Trees")+
  guides(fill = F)+
  scale_fill_gradient()+
  theme_classic()

ggplotly(p, tooltip = "text")



The most important variables are odds, elo_* variables and diff.probs. Our feature engineering has therefore increased the model performance!

More important than the average values given by the Random Forest, we will implement Quantile Prediction (Quantile Regression Forest) to get an idea about the prediction intervals. We will use these predicted quantiles to determine a betting strategy thereafter.

BETTING STRATEGY

Quantile Prediction

# Quantile Prediction
quantiles_reg_forests <- predict(Model_final, 
                                 data = X_test,
                                 type = "quantiles",
                                 quantiles = seq(0, 1, by = 0.1))

pred_intervals <- as.data.table(quantiles_reg_forests$predictions)

# Concatenate actual value (profit) and predicted quantiles  
actual_profit_w_pred_intervals <- data.table(cbind(Y_test, pred_intervals))

ncols <- ncol(actual_profit_w_pred_intervals)

names(actual_profit_w_pred_intervals)[2:ncols] <- c("Q0", "Q10","Q20", "Q30", "Q40", "Q50",
                                                    "Q60", "Q70","Q80", "Q90", "Q100")

# Display the five first rows 
actual_profit_w_pred_intervals %>% head(5)



The Y_test value is the actual profit of the bet (test set). The other variables are the different predicted quantiles (deciles in our case) with the Random Forest Model.

Quantile-based Decision Rule to place bets



Our betting strategy is based on Quantile Prediction. Given a quantile (eg quantile 10%), the strategy considers placing only bets where the predicted quantile (eg quantile 10%) values are greater than 0. It implies that there’s a probability of 90% (quantile 10%) that the profit is positive.

# Function to get the number of bets placed, the proportion of winning/losing bets
# and the overall profit for each quantile-based decision rule 

summarise_bets_placed <- function(quantile){
  
  dt <- copy(actual_profit_w_pred_intervals)
  
  # Distinguish positive profits from negative profits  
  dt  <- dt[, positive_profit := ifelse(Y_test > 0, 1, 0)]
  
  dt <-  dt[get(quantile) > 0,
            .(nb_bets_placed = .N,
              sum_profit = sum(Y_test)),
            by = .(positive_profit)]
  
  dt$decision_rule_quantile <- quantile
  
  dt  <- dt[,
            c("prop_bets", 
              "sum_nb_bets_placed",
              "sum_profit_quantile") := list(nb_bets_placed/sum(nb_bets_placed),
                                             sum(nb_bets_placed),
                                             sum(sum_profit)),
            by= .(decision_rule_quantile)]
  
  sdcols <- c("decision_rule_quantile", "positive_profit", "nb_bets_placed",
              "prop_bets", "sum_nb_bets_placed", "sum_profit", "sum_profit_quantile")
  
  dt <- dt[, .SD, .SDcols = sdcols]
  
  return(dt)
  
}

quantiles <- list("Q10", "Q20", "Q30", "Q40", "Q50",
                  "Q60", "Q70", "Q80", "Q90", "Q100")

bets_placed_quantile <- map(quantiles,
                            summarise_bets_placed)

bets_placed_quantile <- rbindlist(bets_placed_quantile)

bets_placed_quantile



This table give us the number of bets to be placed, the proportion of winning/losing bets and the overall profit for each quantile-based decision rule.


Interpretation

Example_1: Quantile 10%-based Decision Rule. Our strategy is to bet over only if Quantile 10% > 0.
Only 135 bets will be placed. They are all winning bets with an overall profit of 39.9

Example_2: Quantile 60%-based Decision Rule. Our strategy is to bet over only if Quantile 60% > 0.
4125 bets will be placed. 3385 bets are winning and 740 losing. But the overall profit (2054.3) is higher than in Example_1

Example_3: Quantile 100%-based Decision Rule. Our strategy is to bet over if Quantile 100% > 0.
Thus, all the bets will be placed. There is no selection! 7761 bets placed with 3660 winning bets and 4101 losing bets. The overall profit is negative: -310.5

According to his risk aversion, the gambler can choose between:

1. Mitigate risk by placing very few bets and securing the profits - Decision Rule: Quantile 10% > 0

2. Try to optimize the overall profit - Decsion Rule: Quantile 50% > 0 or Quantile 60% > 0. In such a case, there will be losing bets but the overall profit seems to be higher.

Let’s apply our strategy on another random sample from test set to check the robustness of the strategy

Strategy Robustness

# Random sample from test set
set.seed(42) # for reproducibility 

size <- 2500
ind <- sample(1:nrow(X_test), size)
X_test_samp <- X_test[ind]
Y_test_samp <- Y_test[ind]
sum(Y_test_samp)
## [1] -61.284
# Predict on the test set
quantiles_reg_forests <- predict(Model_final, 
                                 data = X_test_samp,
                                 type = "quantiles",
                                 quantiles = seq(0, 1, by = 0.1))

pred_intervals <- as.data.table(quantiles_reg_forests$predictions)

actual_profit_w_pred_intervals <- data.table(cbind(Y_test_samp, pred_intervals))

names(actual_profit_w_pred_intervals) <- c("Y_test","Q0", "Q10","Q20", "Q30", "Q40",
                                           "Q50","Q60", "Q70","Q80", "Q90", "Q100")

quantiles <- list("Q10", "Q20", "Q30", "Q40", "Q50",
                  "Q60", "Q70", "Q80", "Q90", "Q100")

bets_placed_quantile_samp <- map(quantiles,
                                 summarise_bets_placed)

bets_placed_quantile_samp <- rbindlist(bets_placed_quantile_samp)

bets_placed_quantile_samp



Quantile 10%-based Decision Rule.
There are 49 bets placed (over 2500 possible bets) and an overall profit of 14.6

Here again, the optimal strategy seems to be the Quantile 60%-based Decision Rule.
1302 bets placed: 1091 winning and 211 losing bets. The overall profit is 693.3

Out of intellectual curiosity, let’s consider the worst case scenario: what if we only keep bets with negative profit?
What would be the results of our strategy based on quantiles in such a case?

Worst-Case Scenario

# Worst-Case Scenario - we only keep bets with negative profit  
ind <- which(Y_test < 0)
X_test_wc <-  X_test[ind]
Y_test_wc <- Y_test[ind]

# Quuntile Predictions with Random Forest  
quantiles_reg_forests <- predict(Model_final, 
                                 data = X_test_wc,
                                 type = "quantiles",
                                 quantiles = seq(0, 1, by = 0.1))

pred_intervals <- as.data.table(quantiles_reg_forests$predictions)

actual_profit_w_pred_intervals <- data.table(cbind(Y_test_wc, pred_intervals))

names(actual_profit_w_pred_intervals) <- c("Y_test","Q0", "Q10","Q20", "Q30", "Q40",
                                           "Q50","Q60", "Q70","Q80", "Q90", "Q100")

quantiles <- list("Q10", "Q20", "Q30", "Q40", "Q50", 
                  "Q60", "Q70", "Q80", "Q90", "Q100")

bets_placed_quantile_wc <- map(quantiles,
                               summarise_bets_placed)

bets_placed_quantile_wc <- rbindlist(bets_placed_quantile_wc)

bets_placed_quantile_wc



In this worst case, bets are only placed from the quantile 30%-based Decision Rule. Thus, a risk-averse gambler won’t play and then won’t lose money.

Our strategy can therefore be used to mitigate risk and avoid significant financial losses.

 




 

Back to top